로딩 중이에요... 🐣

[코담] 웹개발·실전 프로젝트·AI까지, 파이썬·장고의 모든것을 담아낸 강의와 개발 노트

02 요약정리 | ✅ 저자: 이유정(박사)

네이버 블로그 크롤링 방법 정리

네이버 블로그는 iframe 구조와 JavaScript 렌더링이 포함된 형태이기 때문에, 크롤링을 위해서는 상황에 맞는 도구를 사용하는 것이 중요합니다. 다음은 실제 블로그 본문을 추출하는 방법과 도구별 적용 기준을 정리한 내용입니다.

✅ 주요 과제 정리

네이버 블로그 리뷰 수집
iframe 구조 대응 필요
특정 키워드 필터링
불필요 텍스트(예: "블로그기자단") 제거(cleaning)

📁 디렉토리 구조 예시

project_root/
│
├── naver_crawler/
│   ├── __init__.py
│   ├── naver_blog_scrap_selenium.py
│   └── naver_blog_scrap_request.py
│
└── notebook.ipynb  # Jupyter 실습 노트북

📌 도구별 특징 및 적용 기준

1. `requests`

용도: 정적인 HTML, JSON, XML 등 서버에서 직접 요청해 가져올 때 사용
한계: JavaScript 렌더링된 DOM 결과는 수집 불가

import requests
response = requests.get("https://api.example.com/data")
data = response.json()

2. `BeautifulSoup`

용도: HTML/XML 파싱 → 원하는 태그, 텍스트 추출
한계: 크롤링 기능 없음 (requests와 함께 사용)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
title = soup.find("div", class_="title").text

3. `selenium`

용도: JS 기반 동적 페이지 렌더링 처리, 사용자 인터랙션 구현
적합: iframe 내부 데이터, 클릭/로그인/스크롤 동작 필요할 때
단점: 느리고 리소스 많이 소모

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://example.com")
driver.find_element(By.ID, "loadButton").click()
html = driver.page_source

🧼 텍스트 정제 함수 (cleaning)

import re
import unicodedata

def clean_text(text):
    text = "".join(c for c in text if not unicodedata.category(c).startswith("C"))
    text = re.sub(r"[ \t\n\r\f\v]+", " ", text)
    return text.strip()

🧭 Selenium 기반 블로그 본문 수집

# 파일명 예시: naver_crawler/naver_blog_scrap_selenium.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup
import re
from webdriver_manager.chrome import ChromeDriverManager
import unicodedata


def clean_text(text):
    text = "".join(c for c in text if not unicodedata.category(c).startswith("C"))
    text = re.sub(r"[ \t\n\r\f\v]+", " ", text)
    return text.strip()


def crawl_naver_blog(url):
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    service = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=service, options=options)
    driver.get(url)

    iframe_element = driver.find_element(By.ID, "mainFrame")
    driver.switch_to.frame(iframe_element)

    driver.find_element(By.ID, "post-area")
    res = driver.page_source
    driver.quit()

    soup = BeautifulSoup(res, "html.parser")
    content = soup.find("div", {"class": "se-main-container"})
    if content is None:
        content = soup.find("div", {"id": "post-area"})
    span_tag = content.find("span", text=re.compile("블로그기자단"))
    if span_tag:
        span_tag.decompose()
    content = clean_text(content.text)
    return content

⚡ requests 기반 iframe 크롤링 (JS 렌더링 필요 없음)

# 파일명 예시: naver_crawler/naver_blog_scrap_request.py
import requests
from bs4 import BeautifulSoup
import re
import unicodedata


def clean_text(text):
    text = "".join(c for c in text if not unicodedata.category(c).startswith("C"))
    text = re.sub(r"[ \t\n\r\f\v]+", " ", text)
    return text.strip()


def crawl_naver_blog_by_requests(url):
    res = requests.get(url)
    root_url = "https://blog.naver.com"

    if 'id="mainFrame"' in res.text:
        soup = BeautifulSoup(res.text, "html.parser")
        iframe = soup.find("iframe", {"id": "mainFrame"})
        iframe_src = iframe["src"]
        res = requests.get(root_url + iframe_src)

    soup = BeautifulSoup(res.text, "html.parser")
    content = soup.find("div", {"class": "se-main-container"})
    if content is None:
        content = soup.find("div", {"id": "post-area"})
    span_tag = content.find("span", text=re.compile("블로그기자단"))
    if span_tag:
        span_tag.decompose()
    content = clean_text(content.text)
    return content

🔍 네이버 검색결과에서 블로그 리뷰 추출

# 파일명 예시: naver_crawler/naver_blog_scrap_request.py

def find_review_article(location: str, keyword: str):
    import requests
    from bs4 import BeautifulSoup

    url = f"https://search.naver.com/search.naver?sm=tab_hty.top&ssc=tab.blog.all&query={location}+{keyword}+리뷰"
    res = requests.get(url)
    soup = BeautifulSoup(res.text, "html.parser")

    title_link = soup.find_all("a", class_="title_link")
    review_list = []
    for t_l in title_link:
        title = t_l.text
        href = t_l["href"]
        if keyword in title:
            review_list.append((title, href))

    return review_list

🧪 Jupyter 사용 예

# selenium 기반 블로그 본문 추출 예시
from naver_crawler.naver_blog_scrap_selenium import crawl_naver_blog
crawl_naver_blog("https://blog.naver.com/gangbuk_official/223177144192")

# requests 기반 예시
from naver_crawler.naver_blog_scrap_request import crawl_naver_blog_by_requests
crawl_naver_blog_by_requests("https://blog.naver.com/gangbuk_official/223177144192")

# 리뷰 링크 목록 추출
from naver_crawler.naver_blog_scrap_request import find_review_article
find_review_article("강북구", "맛집")

🖼️ 1) Jupyter vscode

🖼️ 2) Jupyter Browser

jupyter notebook --no-browser --port 8888

✅ 요약

정적 HTML/JSON → requests
HTML 파싱 → BeautifulSoup
JS 렌더링 필요 → selenium
iframe 대응 → iframe URL 추출 후 별도 요청 or Selenium으로 진입
텍스트 정제 → clean_text() 함수 활용

블로그 본문이 iframe + JS 렌더링 구조일 경우, Selenium 사용이 가장 확실합니다.

← 이전: 01 naverBlogScrap

다음 →: 01 무신사 웹 크롤링

💡 AI 인사이트

댓글 커뮤니티

검색

02 요약정리 | ✅ 저자: 이유정(박사)

네이버 블로그 크롤링 방법 정리

✅ 주요 과제 정리

📁 디렉토리 구조 예시

📌 도구별 특징 및 적용 기준

1. `requests`

2. `BeautifulSoup`

3. `selenium`

🧼 텍스트 정제 함수 (cleaning)

🧭 Selenium 기반 블로그 본문 수집

⚡ requests 기반 iframe 크롤링 (JS 렌더링 필요 없음)

🔍 네이버 검색결과에서 블로그 리뷰 추출

🧪 Jupyter 사용 예

🖼️ 1) Jupyter vscode

🖼️ 2) Jupyter Browser

✅ 요약

Python 코드 실행기

📝 입력값 (자동 생성됨)

📤 실행 결과:

사이트 및 광고 문의

💡 AI 인사이트

댓글 커뮤니티

검색

02 요약정리 | ✅ 저자: 이유정(박사)

네이버 블로그 크롤링 방법 정리

✅ 주요 과제 정리

📁 디렉토리 구조 예시

📌 도구별 특징 및 적용 기준

1. requests

2. BeautifulSoup

3. selenium

🧼 텍스트 정제 함수 (cleaning)

🧭 Selenium 기반 블로그 본문 수집

⚡ requests 기반 iframe 크롤링 (JS 렌더링 필요 없음)

🔍 네이버 검색결과에서 블로그 리뷰 추출

🧪 Jupyter 사용 예

🖼️ 1) Jupyter vscode

🖼️ 2) Jupyter Browser

✅ 요약

Python 코드 실행기

📝 입력값 (자동 생성됨)

📤 실행 결과:

사이트 및 광고 문의

1. `requests`

2. `BeautifulSoup`

3. `selenium`